Clustering Algorithms and Validity Measures

نویسندگان

  • Maria Halkidi
  • Yannis Batistakis
  • Michalis Vazirgiannis
چکیده

Clustering aims at discovering groups and identifying interesting distributions and patterns in data sets. Researchers have extensively studied clustering since it arises in many application domains in engineering and social sciences. In the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains. This paper surveys clustering methods and approaches available in literature in a comparative way. It also presents the basic concepts, principles and assumptions upon which the clustering algorithms are based. Another important issue is the validity of the clustering schemes resulting from applying algorithms. This is also related to the inherent features of the data set under concern. We review and compare clustering validity measures available in the literature. Furthermore, we illustrate the issues that are underaddressed by the recent algorithms and we address new research directions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of Probabilistic Clustering Algorithms to Determine Mineralization Areas in Regional-Scale Exploration Studies

In this work, we aim to identify the mineralization areas for the next exploration phases. Thus, the probabilistic clustering algorithms due to the use of appropriate measures, the possibility of working with datasets with missing values, and the lack of trapping in local optimal are used to determine the multi-element geochemical anomalies. Four probabilistic clustering algorithms, namely PHC,...

متن کامل

ارائه یک الگوریتم خوشه بندی برای داده های دسته ای با ترکیب معیارها

Clustering is one of the main techniques in data mining. Clustering is a process that classifies data set into groups. In clustering, the data in a cluster are the closest to each other and the data in two different clusters have the most difference. Clustering algorithms are divided into two categories according to the type of data: Clustering algorithms for numerical data and clustering algor...

متن کامل

Evaluation of Internal Validity Measures in Short-Text Corpora

Short texts clustering is one of the most difficult tasks in natural language processing due to the low frequencies of the document terms. We are interested in analysing these kind of corpora in order to develop novel techniques that may be used to improve results obtained by classical clustering algorithms. In this paper we are presenting an evaluation of different internal clustering validity...

متن کامل

خوشه‌بندی خودکار داده‌ها با بهره‌گیری از الگوریتم رقابت استعماری بهبودیافته

Imperialist Competitive Algorithm (ICA) is considered as a prime meta-heuristic algorithm to find the general optimal solution in optimization problems. This paper presents a use of ICA for automatic clustering of huge unlabeled data sets. By using proper structure for each of the chromosomes and the ICA, at run time, the suggested method (ACICA) finds the optimum number of clusters while optim...

متن کامل

A partition-based algorithm for clustering large-scale software systems

Clustering techniques are used to extract the structure of software for understanding, maintaining, and refactoring. In the literature, most of the proposed approaches for software clustering are divided into hierarchical algorithms and search-based techniques. In the former, clustering is a process of merging (splitting) similar (non-similar) clusters. These techniques suffered from the drawba...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001